Introduction

The dataset I have selected, “Using RNA sequencing to examine age-dependent skeletal muscle transcriptome response to bed rest-induced atrophy, and age independent disuse-induced insulin resistance” (GSE113165), records the transcriptome of vastus lateralis cells of YOUNG (N=9, 18-28 y) and OLD (N=18, 60-79 y) men and women before five days of bed rest (control) and after five days of bed rest (test condition). The subjects are also classified into susceptability (i.e. low, high) to disuse-induced insulin resistance. The study aims to understand gene expression associated with bed rest to offset resulting muscle loss.

In the previous assessment in this course BCB420, Assignment 1, the dataset was processed - cleaned for low counts, normalized using TMM (Trimmed Mean of M-Values) and mapped to HUGO identifiers. Duplicate HUGO ids were averaged and genes with empty/unavailable HUGO ids were removed. The initial downloaded GEO dataset contained 58051 gene ids, however the final processed dataset, contained 14969 gene ids.

Assignment 1 Contents

Dataset Information

Publication Title: Age-dependent skeletal muscle transcriptome response to bed rest-induced atrophy.
Publication Date: 2019-04-01
Publication Journal: J Appl Physiol
GEO ID: GSE113165

Install and Load Dependencies

# check to ensure all needed packages are installed
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

if (!requireNamespace("GEOquery", quietly = TRUE))
  BiocManager::install("GEOquery")

if (!requireNamespace("biomaRt", quietly = TRUE))
  BiocManager::install("biomaRt")

if (!requireNamespace("edgeR", quietly = TRUE))
  BiocManager::install("edgeR")


if (!requireNamespace("knitr", quietly = TRUE))
  BiocManager::install("knitr")

library(GEOquery)
library(biomaRt)
library(edgeR)
library(knitr)

Load the Dataset

We load our information about our datased using the GEO id, we then extract our desired information to produce our sample set with appropriate labels.

GSE113165 <- GEOquery::getGEO("GSE113165", GSEMatrix =TRUE)

show(GSE113165)
GSE113165_data <- (pData(phenoData(GSE113165[[1]])))

#Create sample set
samples <- data.frame(GSE113165_data$description.1, GSE113165_data$`subject:ch1`, GSE113165_data$`age:ch1`, GSE113165_data$`Sex:ch1`, gsub("susceptibility: ", "", GSE113165_data$characteristics_ch1.5), gsub("time: ", "", GSE113165_data$characteristics_ch1.1))
colnames(samples) <- c("id", "subject","age", "sex", "susceptibility", "time")

#Create sample set
samples <- data.frame(GSE113165_data$`subject:ch1`, GSE113165_data$`age:ch1`, GSE113165_data$`Sex:ch1`, gsub("susceptibility: ", "", GSE113165_data$characteristics_ch1.5), gsub("time: ", "", GSE113165_data$characteristics_ch1.1))
samples <- t(samples)
colnames(samples) <- GSE113165_data$description.1
rownames(samples) <- c("subject","age", "sex", "susceptibility", "time")

Expression count data is downloaded and accessed here. Formatted as needed.

#Reading in supplementary files with expression counts
sfiles <- getGEOSuppFiles('GSE113165')
fnames <- rownames(sfiles)
expCounts <- read.delim(fnames[1],header=TRUE, check.names = FALSE)
#Preview of data
kable(head(expCounts))
geneid 10876X1 10876X2 10876X3 10876X4 10876X5 10876X6 10876X7 10876X8 10876X9 14406X1 14406X2 14406X3 14406X4 14406X5 14406X6 14406X7 14406X8 14406X9 10876X10 10876X11 10876X12 10876X13 10876X14 10876X15 10876X16 10876X17 10876X18 10876X19 10876X20 10876X21 10876X22 10876X23 10876X24 10876X25 10876X26 10876X27 10876X28 10876X29 10876X30 10876X31 10876X32 10876X33 10876X34 10876X35 10876X36 14406X10 14406X11 14406X12 14406X13 14406X14 14406X15 14406X16 14406X17 14406X18 14406X19 14406X20
ENSG00000000003 49 33 51 25 54 42 30 28 61 67 51 63 49 59 72 42 51 87 33 17 14 34 38 65 67 111 93 32 41 46 39 41 30 33 40 40 39 35 51 34 47 37 25 28 62 50 24 57 72 78 54 52 128 75 105 99
ENSG00000000005 1 0 1 0 0 0 1 0 1 2 1 0 1 1 2 10 0 4 0 0 0 1 0 2 4 2 2 2 1 3 1 4 1 2 1 4 3 4 1 0 3 5 1 3 2 2 3 7 2 0 2 1 3 1 0 2
ENSG00000000419 207 224 236 102 298 167 115 109 268 352 298 359 288 278 624 279 285 404 181 143 63 185 189 299 183 449 396 90 110 212 237 202 167 175 239 194 223 217 217 235 269 244 149 277 288 267 195 192 408 345 255 229 404 271 625 401
ENSG00000000457 136 112 118 94 194 169 106 96 169 176 220 196 222 243 297 216 212 265 141 115 103 108 134 174 165 265 276 95 146 164 146 109 145 128 144 181 125 136 141 142 178 193 140 200 271 218 166 211 221 227 187 199 384 319 419 345
ENSG00000000460 23 19 26 21 39 32 20 18 45 69 68 48 59 46 65 88 98 82 30 12 24 18 16 37 32 73 76 30 34 44 28 37 36 49 53 44 35 38 36 44 48 51 38 52 66 66 39 65 75 65 56 70 165 88 120 110
ENSG00000000938 27 7 7 5 29 8 9 10 7 15 24 17 14 43 36 27 12 25 25 76 33 7 6 7 15 26 55 15 14 7 14 13 9 9 12 12 32 12 23 6 9 35 11 766 77 16 35 69 20 16 18 24 53 31 42 50

Clean the Data

Here we remove genes with low expression count to clean up the dataset.

dim(expCounts)
## [1] 58051    57
#Removing low counts
cpms <- cpm(expCounts[2:57])
rownames(cpms) <- expCounts[,1]
keep <- rowSums(cpms >1) >=3
expCounts_filtered <- expCounts[keep,]

dim(expCounts_filtered)
## [1] 16674    57

Normalization

#Original Box Plot
data2plot <- log2(cpm(expCounts_filtered[2:57]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM", 
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "GSE113165 Samples")
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")

We will apply TMM (Trimmed Mean of M-Values) to normalize the filtered data. As you can see, the plot (and of course the data) adjusts so all the samples have a similar mean.

#Normalization
filtered_data_matrix <- as.matrix(expCounts_filtered[2:57])
rownames(filtered_data_matrix) <- expCounts_filtered$geneid
d = DGEList(counts=filtered_data_matrix, group=samples["time",])
d = calcNormFactors(d)
normalized_counts <- cpm(d)

#Normalized Box Plot
data2plot <- log2(normalized_counts)
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM", 
        las = 2, cex = 0.5, cex.lab = 0.5,
        cex.axis = 0.5, main = "Normalized")
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")

We can also view the data in terms of a density plot, before and after normalization. We can see that the distribution (sample lines) are tighter and more similar.

#Density Plot
counts_density <- apply(log2(filtered_data_matrix), 2, density)
#Calculate the Limits
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
        xlim <- range(c(xlim, counts_density[[i]]$x)); 
        ylim <- range(c(ylim, counts_density[[i]]$y))
}

cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))

#plot the first density plot to initialize the plot
par(fig=c(0,0.5,0,1), new=FALSE)
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n", 
     ylab="Smoothing density of log2-CPM", main="Initial", cex.lab = 0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])

normalized_counts_density <- apply(log2(normalized_counts), 2, density)
#Calculate the Mormalized Limits
xlim <- 0; ylim <- 0
for (i in 1:length(normalized_counts_density)) {
        xlim <- range(c(xlim, normalized_counts_density[[i]]$x)); 
        ylim <- range(c(ylim, normalized_counts_density[[i]]$y))
}

cols <- rainbow(length(normalized_counts_density))
ltys <- rep(1, length(normalized_counts_density))

#Normalized Density Plot
par(fig=c(0.50,1,0,1), new=TRUE)
plot(normalized_counts_density[[1]], xlim=xlim, ylim=ylim, type="n", 
     ylab="Smoothing density of log2-CPM", main="Normalized", cex.lab = 0.85)
for (i in 1:length(normalized_counts_density)) lines(normalized_counts_density[[i]], col=cols[i], lty=ltys[i])

Visualize the data

I didn’t use labels in this case as it crowded the MDS visual.

In this case the samples are grouped in pre and post bed rest. There seems to some clustering between -0.5 and -1.0 of the x-axis, as well as 1.0 and 1.5, but it is difficult to see if one group is more prominent than another in the grouping.

plotMDS(d, labels="o",
        col = c("darkgreen","blue")[factor(samples["time",])])

Let’s try to visualize in terms of low and high susceptibility to disuse-induced insulin resistance. It looks a bit clearer, and we can see there is a definite difference in groupings of the clusters. Meaning that those with similar susceptibility have similar expression.

plotMDS(d, labels="o",
        col = c("darkgreen","blue")[factor(samples["susceptibility",])])

Perhaps age is also a factor, let’s visualize this. This group clusering is even more distinguishable here, showing the shared expression profiles of the age groups.

plotMDS(d, labels="o",
        col = c("darkgreen","blue")[factor(samples["age",])])

HUGO Identifier Mapping

We now map the ENSEMBL gene ids to HUGO (hgnc) identifiers using biomart, particularly the getGM function.

#Adding another column to normalized_counts for ease of merging later
ensembl_labels <- data.frame(rownames(normalized_counts))
colnames(ensembl_labels) <- "ensembl_gene_id"
normalized_counts <- cbind(normalized_counts, ensembl_labels)

#Loading emsembl human data
mart <- useEnsembl(biomart = "ensembl", 
                   dataset = "hsapiens_gene_ensembl", 
                   mirror = "useast")

#Retreive HUGO symbols for each ENSEMBL ID and set as new rownames
sample_HUGO <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "ensembl_gene_id", values = c(expCounts_filtered[1]), mart = mart)   

Some of the gene ids are not returned by getGM (depreciated or no symbol available) so these are not included in the dataset after merging with the returned ids.

#Integrating into counts
normalized_counts_annot <- merge(sample_HUGO,normalized_counts, by.x="ensembl_gene_id", by.y="ensembl_gene_id", all.y=FALSE)

In addition, some of these returned gene ids have been returned with an empty string for the HUGO identifier, and these are also removed from the dataset.

#Empty HUGO Removal
nonempty_HUGO <- normalized_counts_annot["" != normalized_counts_annot[, 2],]

I also realized that there may be duplicate HUGO ids, so we remove the replicates and only keep a mean of the expression values. This means that as a result there is no ENSEMBL id associated with these expression values, this may cause an issue later, but we can always adjust this to map to one of the duplicate ids.

#Unique HUGO symbols only
HUGO_duplicate_name <- nonempty_HUGO[duplicated(nonempty_HUGO[,2]),2]

for (x in HUGO_duplicate_name){
  dup_index <- which(x == nonempty_HUGO[,2])
  
  #Store ENSEMBL ID
  dup_ENSEMBL <- nonempty_HUGO[dup_index, 1]
  
  #Mean expression values
  replacement_dup <- data.frame(t(colMeans(nonempty_HUGO[dup_index, 3:58])))
  colnames(replacement_dup) <- gsub(colnames(replacement_dup), pattern="^X", replace="")
  
  #Save associated ids
  replacement_dup[,"ensembl_gene_id"] <- ""
  replacement_dup[,"hgnc_symbol"] <- HUGO_duplicate_name[x]
  
  #Remove duplicate rows
  removed_dup <- nonempty_HUGO[-dup_index,]
  
  #Add meaned duplicates back
  unique_counts <- rbind(removed_dup, replacement_dup)
}

#Preview into the dataset
kable(head(unique_counts))
ensembl_gene_id hgnc_symbol 10876X1 10876X2 10876X3 10876X4 10876X5 10876X6 10876X7 10876X8 10876X9 14406X1 14406X2 14406X3 14406X4 14406X5 14406X6 14406X7 14406X8 14406X9 10876X10 10876X11 10876X12 10876X13 10876X14 10876X15 10876X16 10876X17 10876X18 10876X19 10876X20 10876X21 10876X22 10876X23 10876X24 10876X25 10876X26 10876X27 10876X28 10876X29 10876X30 10876X31 10876X32 10876X33 10876X34 10876X35 10876X36 14406X10 14406X11 14406X12 14406X13 14406X14 14406X15 14406X16 14406X17 14406X18 14406X19 14406X20
ENSG00000000003 TSPAN6 5.289463 4.025953 5.7483737 3.3755463 4.052595 4.3699835 4.722525 4.079050 5.3109728 4.517023 3.174709 3.927295 2.8252135 3.221141 3.513752 2.258656 3.2873595 4.672469 3.570428 2.206914 1.638454 4.3521590 4.7169699 5.1087359 6.016743 5.454376 4.884660 4.648841 3.708616 3.8897329 3.526400 4.267296 2.9441147 3.4847624 3.779075 3.302813 3.705487 3.387705 4.479046 3.2853820 3.3742099 3.061681 2.522700 1.747284 3.236915 3.241200 1.763651 3.387475 4.070012 4.3549460 3.505752 3.132316 3.687560 2.810795 3.641845 3.529361
ENSG00000000419 DPM1 22.345283 27.327681 26.6003176 13.7722291 22.364322 17.3758868 18.103013 15.879157 23.3334541 23.731228 18.550263 22.379349 16.6053366 15.177579 30.452516 15.003933 18.3705385 21.697441 19.583254 18.564045 7.373045 23.6808651 23.4607189 23.5001852 16.433791 22.063195 20.799196 13.074864 9.949945 17.9265953 21.429661 21.024238 16.3889053 18.4798004 22.579974 16.018645 21.187782 21.003774 19.057903 22.7077877 19.3119670 20.190541 15.035290 17.285628 15.035993 17.308009 14.329662 11.410443 23.063403 19.2622613 16.554939 13.794240 11.638861 10.156338 21.677648 14.295696
ENSG00000000457 SCYL3 14.680959 13.663840 13.3001588 12.6920542 14.559324 17.5839812 16.686256 13.985313 14.7140065 11.865614 13.694825 12.218252 12.7999469 13.266733 14.494226 11.615948 13.6651023 14.232232 15.255463 14.929127 12.054343 13.8245050 16.6335256 13.6756931 14.817353 13.021707 14.496409 13.801245 13.206291 13.8677435 13.201394 11.344762 14.2298879 13.5166540 13.604671 14.945230 11.876559 13.163655 12.383245 13.7213015 12.7789224 15.970387 14.127118 12.480598 14.148452 14.131633 12.198584 12.539602 12.492677 12.6740096 12.140288 11.987134 11.062680 11.955247 14.532695 12.299289
ENSG00000000460 C1orf112 2.482809 2.317973 2.9305435 2.8354589 2.926874 3.3295112 3.148350 2.622246 3.9179307 4.651860 4.232946 2.992225 3.4017877 2.511398 3.172137 4.732423 6.3168869 4.403936 3.245843 1.557822 2.808779 2.3040842 1.9860926 2.9080497 2.873668 3.587112 3.991765 4.358288 3.075438 3.7206141 2.531774 3.850974 3.5329377 5.1743441 5.007275 3.633095 3.325437 3.678080 3.161680 4.2516709 3.4460016 4.220154 3.834503 3.244956 3.445748 4.278384 2.865932 3.862911 4.239596 3.6291217 3.635594 4.216580 4.753495 3.297999 4.162108 3.921513
ENSG00000000938 FGR 2.914602 0.853990 0.7889925 0.6751093 2.176394 0.8323778 1.416758 1.456803 0.6094559 1.011274 1.493981 1.059746 0.8072039 2.347611 1.756876 1.451993 0.7734964 1.342663 2.704869 9.866206 3.862071 0.8960327 0.7447847 0.5501716 1.347032 1.277602 2.888777 2.179144 1.266357 0.5919159 1.265887 1.353045 0.8832344 0.9503897 1.133723 0.990844 3.040399 1.161499 2.019962 0.5797733 0.6461253 2.896184 1.109988 47.800691 4.020040 1.037184 2.571991 4.100628 1.130559 0.8933223 1.168584 1.445685 1.526880 1.161795 1.456738 1.782506
ENSG00000000971 CFH 38.861361 16.713805 26.8257441 23.3587807 35.872973 34.1274902 35.104104 18.647084 40.6594144 29.866290 29.817369 27.179376 20.8143281 22.984751 27.085171 26.135883 33.9049236 35.714847 32.891211 27.651340 12.054343 27.2649960 17.2541795 28.6875171 25.324203 17.493312 19.433592 24.987518 23.789415 33.1472894 38.519137 55.787087 30.0299702 19.6413878 20.123575 47.395372 40.380301 33.489888 44.263516 42.9998532 54.4898995 45.428718 32.492372 34.072033 35.919317 25.735129 26.454761 37.797094 31.146900 26.1855088 20.580061 26.323506 15.384039 13.716679 18.590751 15.507800

Interpret and Document

What are the control and test conditions of the dataset?

Control condition: Vastus lateralis cells of YOUNG (N=9) and OLD (N=18) men and women before five days of bed rest

Test condition: Vastus lateralis cells of YOUNG (N=9) and OLD (N=18) men and women after five days of bed rest

Why is the dataset of interest to you?

This dataset stood out to me as there are no direct interventions aside from a change in physical behaviour. I usually do not encounter experimental studies that do not incorporate pharmaceutical/biochemical intervention, so I was interested to see the resulting gene expression values.

Were there expression values that were not unique for specific genes? How did you handle these?

Yes. These were removed.

#Number of duplicates
length(setdiff(nonempty_HUGO[duplicated(nonempty_HUGO[,2]), 2], unique_counts))
## [1] 1

Were there expression values that could not be mapped to current HUGO symbols?

Yes. These were removed from the final dataset. The function call to getBM excluded depreciated ids and some of those with no HUGO symbol. There were some empty HUGO symbol strings returned, these were also removed from the final dataset.

#Number of values that could not be mapped to HUGO symbols
length(setdiff(rownames(normalized_counts),nonempty_HUGO[,1]))
## [1] 1704

How many outliers were removed? Low count outliers were removed and assigned to this variable. No other outliers were removed.

#Low count outliers
dim(expCounts_filtered)
## [1] 16674    57

How did you handle replicates? I averaged duplicate HUGO genes and added them to the dataset after removing the old genes.Aside from that, there were no replicate rows in the dataset after filtering, normalization, HUGO mapping and removing duplicate HUGO genes. Otherwise, I would remove them.

table(duplicated(nonempty_HUGO))
## 
## FALSE 
## 14970

What is the final coverage of your dataset?

#Starting coverage from given expression data
dim(expCounts)
## [1] 58051    57
#Coverage following filtering for low counts
dim(expCounts_filtered)
## [1] 16674    57
#Coverage after mapping to HUGO symbols
dim(normalized_counts_annot)
## [1] 16578    58
#Coverage after filtering for nonempty and unique symbols
dim(unique_counts)
## [1] 14969    58

As we can see from the above code snippets, the initial dataset contained 58051 gene ids, however our final dataset, following all the steps of processing, contains 14969 gene ids.

References

[1] Mahmassani ZS, Reidy PT, McKenzie AI, Stubben C et al. Age-dependent skeletal muscle transcriptome response to bed rest-induced atrophy. J Appl Physiol (1985) 2019 Apr 1;126(4):894-902. PMID: 30605403

[2] Mahmassani ZS, Reidy PT, McKenzie AI, Stubben C et al. Disuse-induced insulin resistance susceptibility coincides with a dysregulated skeletal muscle metabolic transcriptome. J Appl Physiol (1985) 2019 May 1;126(5):1419-1429. PMID: 30763167

[3] Isserlin R, BCB420-lectures-public, (2020), GitHub repository, https://github.com/risserlin/BCB420-lectures-public

Assignment 2 Contents

Install and Load Dependencies

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

if (!requireNamespace("GEOquery", quietly = TRUE))
  BiocManager::install("GEOquery")

if (!requireNamespace("biomaRt", quietly = TRUE))
  BiocManager::install("biomaRt")

if (!requireNamespace("edgeR", quietly = TRUE))
  BiocManager::install("edgeR")


if (!requireNamespace("knitr", quietly = TRUE))
  BiocManager::install("knitr")

library(GEOquery)
library(biomaRt)
library(edgeR)
library(knitr)

if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
  BiocManager::install("ComplexHeatmap")

if (!requireNamespace("circlize", quietly = TRUE))
  install.packages("circlize")

if (!requireNamespace("dplyr", quietly = TRUE))
  install.packages("dplyr")

if (!requireNamespace("gprofiler2", quietly = TRUE))
  install.packages("gprofiler2")

library(ComplexHeatmap)
library(circlize)
library(dplyr)
library(gprofiler2)

Differential Gene Expression

We begin with a bit of adjustments to data and matrices that we will use for the analysis. From Assignment 1, I need the sample data to hold the different patient parameters in each column rather than rows. We also adjust the column names in the count data to the descriptive titles so we can access and thus compare the conditions more easily. This is all followed by the creation of matrices for furthur functions.

#Adjust sample set to desired format
samples <- data.frame(GSE113165_data$description.1, GSE113165_data$`subject:ch1`, GSE113165_data$`age:ch1`, GSE113165_data$`Sex:ch1`, gsub("susceptibility: ", "", GSE113165_data$characteristics_ch1.5), gsub("time: ", "", GSE113165_data$characteristics_ch1.1))
colnames(samples) <- c("id", "subject","age", "sex", "susceptibility", "time")

#Adding a new attribute to create specific types based on ANOVA
samples[,"type"] <- paste(samples[,3], lapply(strsplit(as.character(samples[,6]), " "), `[[`, 1), sep="_")

#Changing colnames from patient-sample ids to the descriptive title
match_index <- match(colnames(unique_counts[3:58]), GSE113165_data[,"description.1"])
colnames(unique_counts[3:58]) <- GSE113165_data[match_index,"title"]

unique_counts <- unique_counts[which(!is.na(unique_counts[,2])),]

#Data matrix for expression
expressionMatrix <- as.matrix(unique_counts[,3:58])
rownames(expressionMatrix) <- unique_counts[,2]
colnames(expressionMatrix) <- colnames(unique_counts)[3:58]
minimalSet <- ExpressionSet(assayData=expressionMatrix)

#Create numerical matrix to produce a heatmap plot
heatmap_matrix <- unique_counts[,3:ncol(unique_counts)]
rownames(heatmap_matrix) <- unique_counts$ensembl_gene_id
colnames(heatmap_matrix) <- colnames(unique_counts[,3:ncol(unique_counts)])

#Row Normalization; scaling each row around the mean
heatmap_matrix <- t(scale(t(heatmap_matrix)))

#Changing colnames from patient-sample ids to the descriptive title
match_index <- match(colnames(heatmap_matrix), GSE113165_data[,"description.1"])
colnames(heatmap_matrix) <- GSE113165_data[match_index,"title"]

Now we begin the differential expression

#Creating a design matrix based on the condition that is being tested and accounting for subject variability (i.e. Pre and post bed rest)
model_ANOVA <- model.matrix(~ samples$type)

#Fit to model_design
fit_ANOVA <- lmFit(minimalSet, model_ANOVA)

#Computing differential expression for the model
#trend is TRUE as this is a RNAseq dataset
fit2_ANOVA <- eBayes(fit_ANOVA,trend=TRUE)

#Adjust fix based on Benjamin-Hochberg multiple hypothesis method
topfit_ANOVA <- topTable(fit2_ANOVA, 
                   coef=c(2:ncol(model_ANOVA)),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))

#Merge hgnc names to topfit table
output_hits_ANOVA <- merge(unique_counts[,1:2],
                           topfit_ANOVA,
                         by.y=0,by.x=2,
                         all.y=TRUE)
#Sort by P-value
output_hits_ANOVA <- output_hits_ANOVA[order(output_hits_ANOVA$P.Value),]

1. Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?

The P-values are generated by limma::eBayes().

(head(output_hits_ANOVA[,c("hgnc_symbol","ensembl_gene_id","P.Value")]))

Based on the code below, we can see that 2241 genes are significantly differentially expressed and within our threshold (< 0.05). We use the general rule that a p-value of greater than 0.05 represents the probability that the null hypothesis is true, therefore a p-value of less than 0.05 is statistically significant.

#How many gene pass the threshold p-value < 0.05?
length(which(output_hits_ANOVA$P.Value < 0.05))
## [1] 2241

2. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?

Based on the default adjust.method parameter for edgeR::topTags(),the Benjamin-Hochberg method is used for multiple correction. This method is used over the others as a default and for this analysis, as it represents the false discovery rate amoungst the differentially expressed genes, which is less “stringent” than familywise error rate and thus more powerful.

From the code below, we can see that 163 gene pass correction.

#How many genes pass correction?
length(which(output_hits_ANOVA$adj.P.Val < 0.05))
## [1] 163

3. Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.

Here we have the volcano plot representing the differentially expressed genes within our dataset. The x-axis bound represents a p-value of 0.05.

volcanoplot(fit2_ANOVA, coef=c(2:ncol(model_ANOVA)), style = "p-value",
            xlab = "log2 Fold Change", ylab = NULL, pch=16, cex=0.35, xlim=c(-5, 5), ylim=c(0, 6), highlight = 10, hl.col = "blue", names=rownames(fit_ANOVA))
abline(h=1.3,v=0, lty=2, col="red")

The top differentially expressed genes are as follows (without taking into account valid p-value and adjusted p-value),

#Based on topTable
head(topfit_ANOVA, n=10)
#Based on P-value
head(output_hits_ANOVA, n=10)

4. Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.

There is subtle clustering in this heatmap, but it’s not very clear.

top_hits <- output_hits_ANOVA$ensembl_gene_id[output_hits_ANOVA$P.Value<0.05]
heatmap_matrix_tophits <- t(
  scale(t(heatmap_matrix[
    which(rownames(heatmap_matrix) %in% top_hits),])))

#Grouping samples based on time condition (i.e. pre or post 5 days of bed rest) and susceptability to disuse-induced insulin resistance (i.e. low, high, if available)
heatmap_matrix_tophits<- heatmap_matrix_tophits[,
                                                c(grep(colnames(heatmap_matrix_tophits),pattern = "pre\\_low"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "pre\\_high"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "pre$"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "post\\_low"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "post\\_high"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "post$"))]
if(min(heatmap_matrix_tophits) == 0){
  heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)), 
                           c( "white", "red"))
} else {
  heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}

#Create subtype annotation vector based on matrix columns for a top annotation bar
typeAnno <- c()
susAnno <- c()
for (x in 1:ncol(heatmap_matrix_tophits)){
  typeAnno <- append(typeAnno, samples[which(samples$id == lapply(strsplit(colnames(heatmap_matrix_tophits)[x], "_"), `[[`, 1)), 7])
  susAnno <- append(susAnno, as.character(samples[which(samples$id == lapply(strsplit(colnames(heatmap_matrix_tophits)[x], "_"), `[[`, 1)), 5]))
}

n <- HeatmapAnnotation(type = typeAnno, susceptibility = susAnno)

#Create and display heatmap
time_sus_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = FALSE,
                           show_row_dend = TRUE,
                           show_column_dend = FALSE, 
                           col=heatmap_col,
                           show_column_names = FALSE, 
                           show_row_names = FALSE,
                           show_heatmap_legend = TRUE,
                           top_annotation = n
)
time_sus_heatmap

I notice that if the patients are also sorted/grouped by age, there does seem to be subtle clustering of expression within the age groups (i.e. YOUNG, OLD) and time conditions (i.e. pre bed rest and post 5 days of bed rest). This clustering is supported by statements in the paper that aging corresponds to “biochemical, structural, and functional alterations in skeletal muscle” which therefore would reflect when comparing the expression of young and old patients samples. The clustering of the two time conditions are more prevalent in the old group, which is also supported by the paper.

#Rename samples with descriptive labels for sorting (seen below)
colnames(heatmap_matrix_tophits) <- GSE113165_data[match_index,"title"]

#Grouping samples based on time condition (i.e. pre or post 5 days of bed rest), susceptability to disuse-induced insulin resistance (i.e. low, high, if available) and age.
heatmap_matrix_tophits<- heatmap_matrix_tophits[,
                                                c(grep(colnames(heatmap_matrix_tophits),pattern = "young.*pre\\_low"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "young.*pre\\_high"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "young.*post\\_low"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "young.*post\\_high"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "old.*pre\\_low"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "old.*pre\\_high"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "old.*pre$"),
                                                  
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "old.*post\\_low"),
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "old.*post\\_high"), 
                                                  grep(colnames(heatmap_matrix_tophits),pattern = "old.*post$"))]
if(min(heatmap_matrix_tophits) == 0){
  heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)), 
                           c( "white", "red"))
} else {
  heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}

#Create subtype annotation vector based on matrix columns for a top annotation bar
typeAnno <- c()
susAnno <- c()
for (x in 1:ncol(heatmap_matrix_tophits)){
  typeAnno <- append(typeAnno, samples[which(samples$id == lapply(strsplit(colnames(heatmap_matrix_tophits)[x], "_"), `[[`, 1)), 7])
  susAnno <- append(susAnno, as.character(samples[which(samples$id == lapply(strsplit(colnames(heatmap_matrix_tophits)[x], "_"), `[[`, 1)), 5]))
}

n <- HeatmapAnnotation(type = typeAnno, susceptibility = susAnno)

#Create and display heatmap
time_age_sus_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = FALSE,
                           show_row_dend = TRUE,
                           show_column_dend = TRUE, 
                           col=heatmap_col,
                           show_column_names = FALSE, 
                           show_row_names = FALSE,
                           show_heatmap_legend = TRUE,
                           top_annotation = n
)
time_age_sus_heatmap

Thresholded over-representation analysis

#Read in supplementary files
sfiles = getGEOSuppFiles('GSE70072')
fnames = rownames(sfiles)
my_exp = read.delim(fnames[1],header=TRUE,
                       check.names = FALSE) 

#Merge gene names with the top hits
output_hits_ANOVA_withgn <- ""
output_hits_ANOVA_withgn <- merge(my_exp[,1],output_hits_ANOVA, by.x=1, by.y = 2)
output_hits_ANOVA_withgn[,c("old_pre_rank", "young_post_rank", "young_pre_rank")] <- -log(output_hits_ANOVA_withgn$P.Value,base =10) * sign(output_hits_ANOVA_withgn[,c("samples.typeold_pre", "samples.typeyoung_post", "samples.typeyoung_pre")])

#Hits based on logFC values for each grouping/type
output_hits_old_pre_rank <- output_hits_ANOVA_withgn[order(output_hits_ANOVA_withgn$old_pre_rank),]
output_hits_young_post_rank <- output_hits_ANOVA_withgn[order(output_hits_ANOVA_withgn$young_post_rank),]
output_hits_young_pre_rank <- output_hits_ANOVA_withgn[order(output_hits_ANOVA_withgn$young_pre_rank),]

#Type-"old_pre" up/down regulated genes
old_pre_upregulated_genes <- output_hits_old_pre_rank$hgnc_symbol[
  which(output_hits_old_pre_rank$P.Value < 0.05 
        & output_hits_old_pre_rank$samples.typeold_pre > 0)]
old_pre_downregulated_genes <- output_hits_old_pre_rank$hgnc_symbol[
  which(output_hits_old_pre_rank$P.Value < 0.05 
        & output_hits_old_pre_rank$samples.typeold_pre < 0)]

#Write gene names to file for future reference
write.table(x=old_pre_upregulated_genes,
            file=file.path("data","old_pre_upregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=old_pre_downregulated_genes,
            file=file.path("data","old_pre_downregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)


#Type-"young_post" up/down regulated genes
young_post_upregulated_genes <- output_hits_young_post_rank$hgnc_symbol[
  which(output_hits_young_post_rank$P.Value < 0.05 
        & output_hits_young_post_rank$samples.typeyoung_post > 0)]
young_post_downregulated_genes <- output_hits_young_post_rank$hgnc_symbol[
  which(output_hits_young_post_rank$P.Value < 0.05 
        & output_hits_young_post_rank$samples.typeyoung_post < 0)]

#Write gene names to file for future reference
write.table(x=young_post_upregulated_genes,
            file=file.path("data","young_post_upregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=young_post_downregulated_genes,
            file=file.path("data","young_post_downregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

#Type-"young_pre" up/down regulated genes
young_pre_upregulated_genes <- output_hits_young_pre_rank$hgnc_symbol[
  which(output_hits_young_pre_rank$P.Value < 0.05 
        & output_hits_young_pre_rank$samples.typeyoung_pre > 0)]
young_pre_downregulated_genes <- output_hits_young_pre_rank$hgnc_symbol[
  which(output_hits_young_pre_rank$P.Value < 0.05 
        & output_hits_young_pre_rank$samples.typeyoung_pre < 0)]

#Write gene names to file for future reference
write.table(x=young_pre_upregulated_genes,
            file=file.path("data","young_pre_upregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=young_pre_downregulated_genes,
            file=file.path("data","young_pre_downregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

1. Which method did you choose and why?

There was a pretty substantial difference between the limma and quasi methods, so I chose the limma results as it seemed to align best with the results of the paper and more refined (less genes).

2. What annotation data did you use and why? What version of the annotation are you using?

I decided to use g:Profiler, in particular the gprofiler2 package, as it is familiar to me and also provides a wide variety of sources to get a good overview and grasp of the processess effected by the up and down regulated genes. “The package corresponds to the 2019 update of g:Profiler and provides access for versions e94_eg41_p11 and higher.”

3. How many genesets were returned with what thresholds?

Number of genes in rank list for old_pre…

nrow(output_hits_old_pre_rank)
## [1] 14820
#Preview of table/thresholds
head(output_hits_old_pre_rank)

Number of genes in rank list for young_post…

#Number of genes in rank list for young_post
nrow(output_hits_young_post_rank)
## [1] 14820
#Preview of table/thresholds
head(output_hits_young_post_rank)

Number of genes in rank list for young_pre…

#Number of genes in rank list for young_pre
nrow(output_hits_young_pre_rank)
## [1] 14820
#Preview of table/thresholds
head(output_hits_young_pre_rank)

4. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list?

I only analyzed the young post-bed rest samples for simplicity’s sake.

#g:Profiler queries

#Multi-query: young_post
gostres2 <- gost(query = c(young_post_upregulated_genes,young_post_downregulated_genes), organism = "hsapiens", sources = c("HPA","HP"))
gostres2up <- gost(query = young_post_upregulated_genes, organism = "hsapiens", sources = c("HPA","HP"))
gostres2down <- gost(query = young_post_downregulated_genes, organism = "hsapiens", sources = c("HPA", "HP"))


#Print tables
pt2 <- publish_gosttable(head(gostres2, n = 10), 
                         highlight_terms = gostres2$result[c(1:20),],
                           use_colors = TRUE, 
                           show_columns = c("term_id", "term_name"),
                           filename = NULL)

pt2up <- publish_gosttable(head(gostres2up, n = 10), 
                           highlight_terms = gostres2up$result[c(1:20),],
                         use_colors = TRUE, 
                         show_columns = c("term_id", "term_name"),
                         filename = NULL)

pt2down <- publish_gosttable(head(gostres2down, n = 10), 
                             highlight_terms = gostres2down$result[c(1:20),],
                         use_colors = TRUE, 
                         show_columns = c("term_id", "term_name"),
                         filename = NULL)

Interpretation

1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

Based on the descriptions given by the g:profiler query, it seems that it results do support conclusions discussed in the original paper. Many of the genes are involved in processes that focus on estrogen and female reproductive organs, which also appear in the paper. Not all of the genes were queried above, however the downregulated genes for the young post-bed rest samples did not produce accurate results - this is likely due to my source specification.

2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

In a 2018 study by Dickinson JM et. al. in which the transcriptome of leg skeletal muscle before and after divergent exercise stimuli was observed, the results highlight a notable “increase in estrogen-related receptor-γ”. However it is difficult to find more papers to support this finding. It is likely the presence of other highlighted genes in the paper that supports the change in the transcriptome of vastus lateralis cells.

References

[1] Davis S, Meltzer P (2007). “GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics, 14, 1846–1847.

[2] Durinck S, Spellman P, Birney E, Huber W (2009). “Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.” Nature Protocols, 4, 1184–1191.

[3] Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W (2005). “BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis.” Bioinformatics, 21, 3439–3440.

[4] Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140. doi: 10.1093/bioinformatics/btp616.

[5] McCarthy DJ, Chen Y, Smyth GK (2012). “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research, 40(10), 4288-4297. doi: 10.1093/nar/gks042.

[6] Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.27.

[7] Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963

[8] Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research.Chapman and Hall/CRC. ISBN 978-1466561595

[9] Gu, Z. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.

[10] Gu, Z. (2014) circlize implements and enhances circular visualization in R. Bioinformatics.

[11] Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 0.8.4. https://CRAN.R-project.org/package=dplyr

[12] Mahmassani ZS, Reidy PT, McKenzie AI, Stubben C et al. Age-dependent skeletal muscle transcriptome response to bed rest-induced atrophy. J Appl Physiol (1985) 2019 Apr 1;126(4):894-902. PMID: 30605403

[13] Mahmassani ZS, Reidy PT, McKenzie AI, Stubben C et al. Disuse-induced insulin resistance susceptibility coincides with a dysregulated skeletal muscle metabolic transcriptome. J Appl Physiol (1985) 2019 May 1;126(5):1419-1429. PMID: 30763167

[14] Isserlin R, BCB420-lectures-public, (2020), GitHub repository, https://github.com/risserlin/BCB420-lectures-public

[15] https://support.bioconductor.org/p/12441/

[16] https://support.bioconductor.org/p/18967/

[17] Liis Kolberg and Uku Raudvere (2019). gprofiler2: Interface to the ‘g:Profiler’ Toolset. R package version 0.1.8. https://CRAN.R-project.org/package=gprofiler2

[18] Dickinson JM, D’Lugos AC, Naymik MA, Siniard AL et al. Transcriptome response of human skeletal muscle to divergent exercise stimuli. J Appl Physiol (1985) 2018 Jun 1;124(6):1529-1540. PMID: 29543133